Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

What is this about?

Text to speech

VALL-E¹ is an example of a text-to-speech (TTS) model, the sort of thing that gets used when Siri or your GPS talks to you. The input is text, the output is speech.

TTS has a long history and modern versions can sound quite good, but the generated voice comes from one of a small set of pretrained voices that are baked in.

Let’s listen to some examples. First, here is a human reading this sentence:

Thus did this humane and right-minded father comfort his unhappy daughter; and her mother embracing her again did all she could to soothe her feelings.

Human reading

Here are some examples from some of the best commercially available TTS systems:

Google Text to Speech (neural2)
ElevenLabs (Arnold)
Coqui (Alberto)

Voice cloning

Instead of doing TTS with a specific, pretrained voice, VALL-E does voice cloning: given an example of a person’s voice, do TTS that sounds like that person. In the past, this has typically required tens to hundreds of hours of high-quality speech data.

Here are some examples of what that can do. The quality depends a lot on the input quality of the voice recording being cloned.

Coqui - Malcolm (voiceover artist)
Coqui - Bruce (me!)

Voice conversion

For completeness, I will mention a related thing that people do: voice conversion. This is where you take a recording of someone talking (or singing) and change the voice to sound like someone else. You go directly from speech to speech. This has the advantage of preserving the prosody (pacing and expression) of the original.

It’s probably harder to do because software for it is less common. VALL-E doesn’t do it.

What’s new about VALL-E?

Incredibly, VALL-E can clone a voice with as little as three seconds of input. This is for a voice it has not seen during training (hence, “zero shot”²).

With those three seconds, you get speech that sounds recognizably like the person being cloned, but the quality is not spectacular.[Update: I listened to the samples again, and they are better than I remember them. Some of the prompts are poor, surprisingly.] It’s one of those things where it’s amazing that it works at all, rather than being amazing for how good it is. Given that TTS in general can be excellent, it seems reasonable to expect that the quality will improve with further work.

Why clone?

At this point, you may be thinking that there are several obvious nefarious uses for voice cloning (you’re so evil!), but there are also several legitimate reasons to work on it:

personalized voice assistants using the voice of a loved one
speech restoration for individuals who have lost their ability to speech
in the film and gaming industries, to create or modify the dialog of characters
to improve speech enhancement systems that pick a voice out of interfering sounds but create artifacts in doing so

What does it sound like?

See the VALL-E demo page. Unfortunately there’s no pretrained model available for us to play with.

How does VALL-E work?

VALL-E uses the techniques of language models (like ChatGPT). They can do this because spoken text and written text can both be converted into phonemes (the smallest units of speech distinguishing one word from another). This lets us bridge spoken words and written text.

In a way described below, they use the 3-second phonemes as a prompt to generate more phonemes which are conditioned on the phonemes from the text.

I found it useful to map the terminology of language models into the terms VALL-E uses.

Language model	VALL-E
system prompt	acoustic prompt
tokens	phonemes
embeddings	discrete acoustic codes (sometimes confusingly called tokens)
n/a	conditioning

The main components of VALL-E are:

EnCodec
Residual vector quantization (RVQ)
An autoregressive (AR) transformer
A non-autoregressive (NAR) transformer

EnCodec

A lot of the heavy lifting in VALL-E is done by FAIR’s EnCodec. It provides a new way to do audio compression using neural networks. Audio samples are here. I haven’t quite figured out EnCodec—it does well in their subjective and objective evaluations, but the samples sound pretty artifact-y to me. I think the audio quality of VALL-E has a lot to do with EnCodec.

In VALL-E, EnCodec is used to (1) create embeddings of the audio prompt and audio training data, and then (2) to turn predicted embeddings back into audio.

Residual Vector Quantization

The general idea of vector quantization (VQ) is to cluster the embeddings and find the centroids of the clusters. The centroids are labeled, forming a codebook. For a given embed vector, you give it the label of the closest centroid. For VALL-E, the centroids are learning during training.

The net result is that the embed vectors that come from the EnCodec latent space are transformed into a “vocabulary” of discrete symbols that are amenable to language modeling.

With residual vector quantization (RVQ), you first do VQ with an embed vector, then apply VQ (with a different codebook) to the residual = the difference between the centroid and the original vector. You can repeat this process, and VALL-E does, for a total of 8 VQ operations for each embed. The terminology for the output of each VQ step slips around a bit, but let’s stick with “audio code”. The first level audio code is the “primary” one, the others are “residuals”.

The general idea is that the primary audio codes capture the linguistic content while the residuals capture the speaker characteristics. I don’t think there is any direct evidence that this is true, but it’s a reasonable hypothesis. Because of these different interpretations, there are different models for each, described below.

The AR transformer

The autoregressive (AR) transformer is like a GPT language model: it is a decoder-only transformer that predicts the next primary audio code given the previous codes, and is conditioned on the text phonemes and the audio prompt.

It is trained like a regular GPT model to generate acoustic codes, but conditioned on ground truth codes. To create training data, they process Libri-Light to create pairs of acoustic code data and phoneme data, from the audio and text transcript respectively. They give the model the text transcript as an input. In addition to the loss (cross-entropy or whatever) for predicting the next acoustic code, there is a loss term to compare the generated code to the ground-truth acoustic code derived from the audio.

How does it get started? The 3-second acoustic prompt provides the first few acoustic codes.

The intuition is that because the primary audio codes are related to the linguistic content, they should be handled like a text generation task.

The NAR transformer

The non-autoregressive transformer (NAR) ~~is more like a language translation model (encoder-decoder transformer) where~~ the entire input is processed at once. It is applied to the residual audio codes. The intuition is that the residuals relate to the speaker characteristics that are sort of independent of what is being said, so there is less dependence on previously generated codes, and they can be processed all at once.

Some questions I had

How novel is it?

I thought maybe this is the first time that anyone had used language modeling for TTS, but I think Google’s AudioLM has some of the same ideas. Still, it’s a very interesting new direction that seems to have a lot of potential.

Is a 3-second prompt the only option?

They also tested with 5 and 10 second prompts. No surprise—it works a bit better with longer prompts.

What’s up with the “continual” version?

I think this means the acoustic prompt is based on text that matches the text. For VALL-E-continual, you could give the voice-prompt person the script. They read the first 3 seconds and that becomes the voice prompt. The text input starts after the 3 seconds that the person spoke.

This gives semantic continuity to the whole process. It seems to help a bit.

What determines the speed of the generated speech?

This is not described explicitly in the paper and I’m not completely clear on this. It probably happens in the AR model. The acoustic prompt will have a certain speaking speed which is reflected in the number of acoustic codes for each phoneme.

Acoustic codes are created at a fixed rate of 75 Hz, so about 13 ms each. Phonemes can be from 5 - 670 ms, so typically there are many more acoustic codes than phonemes.

The loss function is based on acoustic codes. If it included a term to compare the generated speech with the training samples, could that improve the audio quality?

Good idea. They should try that. 🙂

It’s a bit tricky though because most audio comparisons require sample-aligned inputs and outputs. Due to the vagaries of the timing of the generated speech, such a comparison might be difficult.

If you run VALL-E again on the same text, the output will be different. Why?

Like GPT text models, it is because the AR model takes random samples from the predicted distribution of the next acoustic code.

Why do they train their own ASR?

Probably because they need alignment data. During training, the loss function compares the generated acoustic codes with ground truth codes derived from voice recordings that are paired with transcriptions. Because the transcriptions are processed on phoneme boundaries, the acoustic data needs to be aligned with those boundaries. This is the reason for the “forced alignment” that was mentioned.

Is RVQ disentangling?

I don’t have an answer. I think the authors are implying that the primary code and the residual codes disentangle linguistic information from speaker characteristics. Are they saying that? How would you go about finding evidence for or against that hypothesis?

Footnotes

The name is an homage to DALL-E which is a mashup of WALL-E and Dali. I wouldn’t say that VALL-E makes a whole lot of sense as a name.↩︎
By contrast, “few-shot” would mean it hadn’t seen the voice during training, but was given a few examples of the new voice, a corresponding transcription, and the desired output. This would be a kind of fine-tuning.↩︎